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Abstract 

In this paper, a new family of parallel schemes for directly solving linear 
systems will be presented and analyzed. It is shown that these schemes 
exhibit a near optimal performance and enjoy several important features: 

1. For large enough linear systems, the design of the appropriate parallel 
algorithm is insensitive to the number of processors as its performance 
grows monotonically with them. 

2. It is especially good for large matrices, with dimensions large relative 
to the number of processors in the system. In this case, it achieves 

• an optimal speed up. 

• an optimal efficiency. 

• a very low communication time complexity. 

3. It can be used in both distributed parallel computing environments 
and tightly coupled parallel computing systems. 

4. This set of algorithms can be mapped onto any parallel architecture 
without any major programming difficulties or algorithmical changes. 


♦Work funded by NASA Space Act Agreement C-99066-G. 


1 Introduction 

Solving a linear system Ax = b for i € R n , where A € R n x R n is an n x n 
square matrix and b € R n is a known vector, is a very basic computing 
operation in many scientific fields. It is known that the sequential time 
complexity of a direct procedure for computing x is 0(n 3 ). This time is 
intolerable when n is large. Therefore a practical parallel algorithm to speed 
up the computation is almost a must. Dongarra, Gustavson and Karp [1] 
have analyzed various ways for implementing Gaussian elimination (GE) 
with implications for several architectures. They have pertubed the three 
sequential nested loops into six different loop variations. George, Heath 
and Liu [2] the later formed these six methods for self-scheduling a pool of 
tasks (for the Cholesky algorithm in particular). The “submatrix” version 
is the closest method to the algorithms that will be presented in this paper, 
as they have a “wave-front” like structure. Recently, Lin and Zhang [3] 
and Zhang and Lin [4] have presented a class of efficient parallel algorithms 
for solving a linear triangular system and parallel algorithms for the LU 
decomposition of a matrix, which can be applied easily on any parallel 
machine or distributed environment. Fortunately, it turns out that these 
algorithms are part of a more general family of algorithms to solve a linear 
system of equations. Using somewhat similar concepts, an efficient and 
practical parallel algorithm for solving such a system will be developed in 
the present paper. It is obvious that by combining these two algorithms, it 
is possible to get a parallel algorithm for solving the linear system Ax = b. 
However, although this method can be programmed and handled fairly 
easy, this algorithm turns to be not the best one in this family of parallel 
algorithms, as there exist algorithms that mix these two steps, resulting in 
a larger speed-up. There are two important factors needed to be considered 
when trying to design a useful parallel algorithm: one is communication and 
the other is the processor waiting time. The communication time is defined 
to be the time spent on transferring information between the processors or 
a processor and a memory. The waiting time is defined to be the processor 
idle time while waiting for the necessary information elements from the 
other processors. The main goal in designing the parallel algorithm is two- 
fold: 
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• to minimize the above two parts of the computing time while using a 
small amount of processors 

• the number of processors should not depend on the input size n. 


In the rest of the paper, it will be assumed that the number of available 
processing elements (PE) in the parallel system is P+l, which are indicated 
by PEo, PE\, ■ • • , PEp. The topology of the parallel model consists of P 
PEs connected in a ring like shape, i.e. PEi is connected to the P2£,_i 
and PE i+x for all 1 < t < P, where PE 0 serves as a master processor as it 
may have a direct connection to each of the processors in the system. The 
later connection is not a must for the present analysis but it is convenient 
for understanding the algorithm. The computational model of the parallel 
machine is of the loosely coupled system where all the information is passed 
via the interconnection network and a shared memory is used very seldom. 
Thus, for the shared memory, one can assumes whatever reasonable model 
(coomon bus or a special private connection). Each PE can be: 

• a regular scalar processor. 

• a vector processor. 

• another (smaller) parallel computer. 


2 The Parallel Algorithm Setup. 

Generally speaking, this algorithm consists of two global steps where each 
of them is executed in a parallel manner. For the first step we will use the 
following two positive sets of integers K = {fc,}™ p 2 and N = {n,}™ p 2 which 
are related to each another by 


m — n,_x — ki 

and where m p will be defined later as the number of phases that the first step 
of this algorithm undergoes till its final termination. Also it is convenient 
to append to the matrix A by adding to it an n + 1 column which is equal 
to the right hand side vector 6. 
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In the begining of the algorithm, as a preparation for the first step, the 
elements of the matrix A are distributed among the processors in the row- 
wise direction: PEq gets the first k x + kj rows of the matrix, while each of 
the other processors i gets n ~* * ~* a rows of the matrix A , from row number 
ki + kt + s= f L (i — 1) + 1 to ki -f fa + For convenience, let us define by 

SET{s,t) the set of equations from equation number s to equation number 
t. Also let us define the set of rows that PE{ has in the m th phase of the 
first step of the algorithm as follows: 

SET™ = SET(k m + k m+1 + — (t - 1) + 1 , k m + k m+1 + — 0 (1) 

P P 

As it was mentioned before, the suggested parallel algorithm consists of 
two basic steps: the forward step and the backward step, where both are 
similar to the forward and backward steps of the scalar GE procedure. 
The parallel algorithm for the forward step is described in table 1 which 
sometimes called the parallel activity table for this step. Here r < ki is 
a positive integer selected by the user, and usually depends also on the 
characteristics of the parallel machine. Intuitively, if the parallel machine 
is homogeneous, then we will try to keep r to a minimum (r — l). However, 
in a hetrogeneous environment, r will probably relate to the relative powers 
of the processors. 

The resultant matrix of this step has the structure of a "staired like” 
upper triangular matrix. The parallel algorithm for the backward step is 
based on the parallel algorithm for triangular system devised by Lin and 
Zhang [3]. The sequential algorithm for triangular systems may be written 
as follows: 


b i 

x x = ; x k = 

an 


h_ 

&kk 


E 


»=i 



&kk 


k = 2, 3, • • • ,n. 


( 2 ) 


As in the first step, the idea is to let each PE perform the same amount 
of computations between two successive communication operations, and to 
make the redundant computations as small as possible. Those redundant 
computations results from the fact that the algorithm is parallelized, and 
some of the operations are needed to be executed not only by one pro- 
cessor (as in the scalar case) but by several of them (whichever needs the 
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Phase 

number 


m > 2 


Processor 0 

Processor t, 1 < * < p 

1. Executes the GE algo- 

1. Receives the values of the 

rithm for the first r vari- 
ables. 

first r variables. 

2. Substitutes the values of 

2. Transmits the results of 

the first r variables into its 

the r variables to the rest 
of the processors. 

3. Executes the GE algo- 
rithm for the next k\ — r 
equations. 

set of SET^ equations. 

1. Substitutes the values of 

• Substitutes the values of 

the last k m _i variables. 

2. Executes the GE algo- 
rithm for the next k m 
equations. 

3. Transmits the results of 
those k m variables to the 
rest of the processors. 

k m _ 1 — 6m,2 r variables into 
its SET™ equations. 

Transfers to the i — 1 processor ^(p — i + 1) of its first 

equations, and receives ^“•(p — t 

rows from the i + 1 processor 

locating them at the end of its set of rows. 
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appropriate results). This step is executed in + 1 phases. Similar to step 
1, let us denote by Q = {?i}<=x a sequence of positive integers, such that 
qi < n. Then a possible parallel backward step is summarized in the 
parallel activity table 2. 

3 The Main Properties of the Algorithm. 

One of the more important features of this parallel algorithm is that all the 
processors are working all the time towards the solution of the problem. 
More than that. It is possible to minimize the waiting time for needed 
information or data for each of the processors. This means, that the exe- 
cution time of all of the processors between two successive communication 
operations (independent on the task that they ought to perform), will be 
the same (for all of them). In order to set up the equations for this time- 
balance requirement, let us define first the following quantities: 

• GE(a,j3 ) - is the time that it takes for one processor to execute the 
Gaussian elimination algorithm for a variables with /3 right hand 
sides. 

• TR(a, (3) - is the time that it takes to transfer a vectors each of length 
of /? to the neighbor processor. 

• REC(a,(3) - is the time that it takes to receive a vectors each of the 
length of (3 by the neighbor processor. 

• MU LT(rl,r2, r3) - is the time that it takes to multiply two matrices, 
one of the sixe of rl x r2 and the other of the size r2 x r3. 

• SUBT(rl,r2) - is the time that it takes to subtract two matrices one 
from the other where their size is rl x r2. 

In terms of the above definitions, if it is desired that the execution time of 
PE 0 will be equal to the execution time of any of the other P Es in each of 
the phases of the forward step. This means that for the first phase of this 
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Phase 

num- 

ber 

Processor 0 

Processor *, 1 < » < p 

1 

1. Executes the scalar algorithm 
for the first r variables. 

2. Transmits the results of the r 
variables to the rest of the pro- 
cessors. 

3. Executes the scalar algorithm 
for the next q\ — r equations. 

1. Receives the values of the first 
r variables. 

2. Substitutes the values of the 

first r variables: bj <— b } — 

Sower r element* a i j x i where q\ + 

1 + [(t - 2)^1 < j < qt + 

i + [(.-- i)^l. 

3. Transmits back the new values 
of 6,’s to PEq. 

m > 2 

1. Calculates the values of the 
next q m variables, using the 
latest values of the bs. 

2. Transmits the results of those 
q m variables to the rest of the 
processors. 

• Substitutes the values 

of q m - 1 — ^m, 2 r variables into 
its set of equations. 

• Reports back to PEq the new 
values for the bs. 

m — 
T p + 1 

finishes to solve in a sequential mode 
the rest of the equations. 



Table 2: Parallel Activity Table for the Backward Step 
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step we have the following balance: 


GE(r,n — r) + TR[r,n — r)+ REC(r,n — r)+ 

MULT{k 1 -r,r,n-r)+ = MULT{*f,r,n - r)+ (3) 

. SUBTfa-r^J + GEfa-r,^) J [ SUBT{^-,n - r) 

Similarly for the m th phase of this step and any processor t, the following 
time balace holds: 

MULT(k m ,k m -i,n m -i)+ MULT{^,k m -\ — i m , 2 ,n m _i)+ 

SUBT{k m ,n m _ 1 )+ = SUBT{^,n m _ 1 )+ 

GE(k m ,n m )+ T’J?(^ l (p — t ' + l),n m+1 )+ 

TR(k m ,n m ) + REC(k m+ i,n m ) J [ REC(^-{p - i),n TO+ i) 

( 4 ) 

Once we are given the appropriate execution time for any of the above 
operations, those equations determine the sets of integers K and N . These 
operations depend very much on the basic features of each processor in 
the parallel engine. In what follows we will assume that the processors 
are regular and standard scalar processors, while in the extended version 
of this paper we will present also results for vector processors and parallel 
processors (means that the engine is of the degenerate parallel system type 
like a paralleld-vectored machine). 

For a scalar processor, let us define the CPU times needed to execute an 
addition, a multiplication and to find the inverse of a scalar number by A , M 
and I respectively. Also we will assume the following time complexities: 

• if the LU decomposition approach is used than 

GE{r,s ) = M^[(s + l)r 2 + (3s + l)r-s] 

+A-[2{s + l)r 2 + 3(s - l)r + (1 - 5s)] + Isr. (5) 
6 

If the regular elimination is used, than 

r 2 3f 5 

GE{r, s) = {M + A)r{- + — + s 2 ) + Mr{r + 2s) (6) 

O It 
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• MULT{r,t, s) = {A + M)rts. 

• SUBT(r,s) = Ars. 

where the contributions of the broadcasting time complexities will be dis- 
cussed in the extended version of the paper. 

Lets define also by a and (3 the following quantities: 

a = A/M ; /? = J/M (7) 

It is obvious that a < 1 and /3 > 1 where for most of the machines /? 
is upper bounded roughly by 10. Then the above balance of the execution 
time between the master processor, PEq and the rest of the processors can 
be expressed as: 

dok^ + dik^ + a 2 k 2 m + a 3 k m + a 4 = 0 (8) 

where the LU decomposition version for the GE operation eq.(l5) was used. 
The coefficients of this equation for k m are given as follows: 


a 0 = 

1 + a 



(9) 

a i = 

~n m -i 

+ 3 < 1+ 2> 


(10) 

a 2 = 

-3(1 + f )»m-l + (1 + \<* 

+ 3/S) 

(11) 

a 3 = 

-«m- 1 

[3 (k m - x + a) 

P 

- (1 + - 3/3)) 

(12) 

a 4 = 

3 P+1 

{km - 1 “1“ £*) n m-l 


(13) 


Based on this model for the GE procesure of PEq we have the following 
results for the forward step: 

Lemma 1: For large values of n m , k m = ” a ‘ + 0(1). 

Proof: Assume a solution of the singular pertubation type k m = + e 

where e is a weaker function then the leading term in this expression, then 
it is possible to show that 

3a(l + 0.5a) + [(1 + 2.5a)(2 + a) + 3 a(3 - 3^a(l + a){k m ^ l + a)](^) 
C = 1 + 3(1 - 2a)(l +0.5a)£^ 
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□ 


In addition, it is possible to find lower bounds for the values of the 
elements of the sequence K as is given by the following lemma: 

Lemma 2: The smallest value of n m _i for which a real solution exists 
for k m is approximately 3a (1 + a) 4 . For this case, the value of k m may be 
approximated by (1 + a)[3a(l + a)fc m _x]s. 

This lemma, which can be proven easily, leads to the following upper 
bound on the number of phases that the the forward step requires till its 
termination. 


Lemma 3: The maximum number of phases is upper bounded by 0(C log n) 
where C « log(l + a) 

Finally, based on the above lemmas, we can prove the following upper 
bound for the speed-up of the forward step only: 


Theorem 1: The maximum speed-up of the forward step cannot exceed 
O(logP). 

We should realize that this result is for the version of GE(r,s ) given 
by eq. (5) and is modeled in eq. (8). The origin for this poor theoretical 
speed-up is probably in the bad algorithm for the scalar GE procedure. 
When using eq.(6) for this procedure, the following recursive equation for 
the integer set K is obtained: 

+ a-ikh + a 2 k m + a 3 = 0 (14) 

where the coefficients a, are given as follows: 

a 0 = -(1 4- a) (15) 
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a x = 

3 . 2 . 
- - (i + a)(n m _ x + -) 

(16) 

a 2 = 

-n m ~i[k m -i + 1 + + «)] 

(17) 

a s = 

^(k m -i + a)n^_ x 

(18) 


In the following lemma we try to isolate the proper root for eq.(14). 


Lemma 3: Eq.(l 4 ) has two positive roots for k m one which is larger than 
n m _ i, k$ > n m _ i and one which is smaller than n m _ i, k$ < n m _ x . 

The proof for this lemma is simple, and one of the outcomes of this 
proof is that asymptotically we can approximate the first root as is given 
in the following lemma: 

Lemma 4 : The largest root of eq.(14) is given by 

= \JiM~p + i)*=^i + « (is) 

where e can be approximated by 0(n^_ x ) where -7 << 0.5. 

Using this result, it is possible to isolate the second positive root, as it 
is given in the following lemma: 

Lemma 5: A:^_ x ~ (1.5 2 ^ ^/c m - X n^_ x ) *. 

Although the proof for this lemma involves lengthy algebra, it is quite 
straight forward. The following table gives some numerical results for the 
set K for n = 1000000 , Jfc x = 4 and r = 1 , for various number of processors. 
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P and phase number 

i=l 

i=2 

i=3 

i=4 

i=5 

i=6 

i=7 

2 

1 

36042 

483121 

480835 
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i=l 

i=2 

i=3 

i=4 

i=5 

i=6 

i=7 

1 

24102 

294872 

421064 

259909 



100 

i=l 

i=2 

i=3 

i=4 

i=5 

i=6 

i=7 

1 

13554 

144757 

232322 

205233 

146224 

95921 

i=8 

wsm 

i=10 

S5M 


i=13 

i=14 

60904 

38170 

23979 

14804 

9202 

5718 

3553 

ism 

ssm 

om 

QB 


i=20 

wm 

2207 

1371 

851 

528 

328 

203 

126 


The first phenomena that can be observed from this table is that the 
number of phases till termination of the forward step is an increasing func- 
tion of P. In fact we have 

Lemma 6: The number of phases is the forward step is upper bounded 
by 0(3 log P) plus a very weak function of log 

It should be noted that the forward step is terminated for m = m p 
where k mf > n mr . From this condition we may have also some intuition to 
what would be an upper bound for the speed-up of this step. Since the the 
last phase is a scalar phase, as it is executed only by PEq, then we have: 


Lemma 7: The speed-up of the forward step is upper bounded by P — 



In the forthcoming paper we will give some more results which are not 
as important as the above, for example, approximations for the elements 
of if as function of n, Jfci and P. Those results gain some importance when 
this method is implemented on an actual parallel machine. 
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For the backward step, the following equations presents the relations 
between the successive elements in the sequence Q. For the first phase, the 
balance of the CPU time is: 

9i(9i + !) ( n ~ 9i) ( 20 ) 

2 P 


and the balance for the phase m is: 

, 9m (9m + 1) 
9m 9m— 1 T 0 


9m- 1 

P 


(n 


m— 1 

- Yl 9 . ) 

1 


( 21 ) 


Also the rows of A matrix that correspond to those ft^’s are transferred 
to the PE 1. The above equations for the backward step was solved nu- 
merically for the case of P = 2, and the results are given in the following 


table: 


n 

i 

1 

2 

3 

4 

5 

6 

10 4 

9 . 

139 

1400 

2814 

2337 

1434 

820 


X 

7 

8 

9 

10 

11 



9 . 

462 

260 

146 

82 

46 


n 

i 

1 

2 

3 

4 

5 

6 

10 s 

9 . 

43 

213 

279 

197 

116 

66 


t 

7 

8 

9 





9 . 

37 

21 
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In the following section we will prove that the running time of this 
procedure for determining the integer sequence Q is also small since the 
number of phases of this step till termination is: rp = 0[ Plog(n) ]. To 
obtain the order of of the number of phases r>, let’s study a slightly different 
recursive system for u, which is defined by the following equations: 


u T 


n — Ui 


Define: iV, 


Ui-iUi + 

p - 1 


u: 


2 P - 1 

U,-x(»- Ey = x tt ») 
P-1 


, {fori > 2) 


( 22 ) 


(23) 


U? 

, then we have iV, + i = u.+x(l + pci + - 2 U~)> an< ^ 
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2 j 

using N i+ i = N f - we get finally u i+1 (l + j*^) + ^ = u, + j^-. 
Define: t f - = then the following equation which is equivalent to eq. 
( 14 ) is obtained: 

| + *•(! + jTTj) - (1 + ^ = 0 ( 24 ) 

Solving the equation ( 24 ) for t if we get t, = yj\ 1 + pry) 2 + 2 + ti_i — (1 + 

pry)- Define: /?, = *, + 1 + pry, we have fa = ^/(pry)(l + pry) + 2 + fa_ x . 
Now the following lemma is obvious. 

Lemma 7 : Lp = lim,,^*, fa = \ + y'f + p^y(l + p^j) 

Now let’s define n = yfl + (6 - pry)y^> ^/[pM 1 + ^ and 

R = log(Lp + e). The following lemma gives an estimate of the speed of 
the convergence of the sequence {fa}. 

Lemma 8: Given e > 0, let N ( = log(log(( 3 2 ) ) -log(R), then fa < Lp + e 
for * > N t . 

Proof: Simply by realizing first that log(Lp + t) — 2 log(n) > 0, and secondly, 
since fa > Lp, then fa < Py/fa-i- 

□ 

From the definition of fa, we easily see now that ti < Lp - (1 + pry) + e 
for i > N t . 

Lemma 9 : u, < uj when t > 0 ( Plog(n ) ). 

Proof: We choose e such that Lp — (1 + pry) + £ < 1 — 7, where 7 is a fixed 
number, and 0 < 7 < 1. Consider the equation Lp — (1 + pry + = 1, 

then e- = 1.5 + ^ - v'f + ehtl + n^) = 0 ( ^ ). If we leti = ± 
and e = 7, then we have t, < 1 - 7 when i > N ( . That is u, +1 < (1 — 7)u, 
when i > N e . From log(R) = 0 ( log(P ) ) and u ,+k < (1 — 7)*u, for i > N c , 
we can easily see the lemma is true. 

□ 


Theorem 2: r p < O(Plogn). 

Proof: By induction it can be shown that ^ < 1 — n~*. Using this result, 
a lemma for fa which is similar to lemma 3 for u, can be formulated, and 
the result follows. 
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□ 


4 Analysis of The Algorithm 

Here we give a simple analysis of the parallel algorithm subject to certain 
models for the communication complexity, see [5] for more information. In 
the final version of the paper, we will consider a broader range of models 
for communication. Let us define the time needed to broadcast a packet 
of information of size N as = a + TpN, where a is start up time, and 
Tp is a function of P. IV = 1 if the information is sent to the immediate 
neighbors, and IV = P if the information is sent to all other P PEs. We 
define T e (n ) as the communication time of the parallel algorithm. 

In the parallel algorithm, PEq has to send its result to all other P processors 
at each phase of the forward step and all the other processors send only 
the result in the worst case to the immediate neighbors. We can easily get 
an upper bound for T c (n). 

Theorem 3: for the forward and the backward steps T c (n) < 0( Pn + 
nj \ 

2(P-1) )’ 

Let’s define by Ti(n) the computational sequential time and by Tp(n) 
the computational parallel time using P PEs. Let Sp(n) = denote 

the speed up and Vp — Sp ^ denote efficiency. 

Definition: It is said that the speed up is optimal when lim^oo Sp(n) = P , 
and the efficiency is optimal when lim n _ f00 rjp (n) = 1. 

Theorem 3: The present parallel algorithm has optimal speed up and 
optimal efficiency. 

Proof: From the result of theorem 2 and Ti(n) — we can easily reduce 
to the results. 

□ 
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